knitr::opts_chunk$set(
  message = FALSE, 
  warning = FALSE, 
  tidy=FALSE,     # display code as typed
  size="small")   # slightly smaller font for code
options(digits = 3)

# default figure size
knitr::opts_chunk$set(
  fig.width=6.75, 
  fig.height=6.75,
  fig.align = "center"
)

1 Where Do People Drink The Most Beer, Wine And Spirits?

Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries.

library(fivethirtyeight)
data(drinks)

We have one column with character data which contains country names and four columns with numerical data which contains the amount of each type of alcohol consumed in these countries. There are no missing variables.

glimpse(drinks)
## Rows: 193
## Columns: 5
## $ country                      <chr> "Afghanistan", "Albania", "Algeria", "An…
## $ beer_servings                <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, …
## $ spirit_servings              <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75…
## $ wine_servings                <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 19…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8,…
skim(drinks)
Data summary
Name drinks
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁
desc_beer<- drinks %>% 
  arrange(desc(beer_servings))

top25_beer<-head(desc_beer,25)

ggplot(top25_beer,aes(x=reorder(country,-beer_servings),y=beer_servings)) + geom_col() + labs(x="Country", y="Beer Servings", title="Top 25 Beer Consuming Countries") + theme(axis.text.x=element_text(angle=90)) + NULL

desc_wine<- drinks %>% 
  arrange(desc(wine_servings))

top25_wine<-head(desc_wine,25)

ggplot(top25_wine,aes(x=reorder(country,-wine_servings),y=wine_servings)) + geom_col() + labs(x="Country", y="Wine Servings", title="Top 25 Wine Consuming Countries") + theme(axis.text.x=element_text(angle=90)) + NULL

desc_spirit<- drinks %>% 
  arrange(desc(spirit_servings))

top25_spirit<-head(desc_spirit,25)

ggplot(top25_spirit,aes(x=reorder(country,-spirit_servings),y=spirit_servings)) + geom_col() + labs(x="Country", y="Spirit Servings", title="Top 25 Spirit Consuming Countries") + theme(axis.text.x=element_text(angle=90)) + NULL

In the first plot we can see that Namibia has the highest yearly consumption of beer per capita. Beer is one of Namibia’s most iconic commodity and their top quality brewing is a legacy of German colonialism.The second most beer-loving country is Czech Republic, known for its unique brewing methods and prevalent drinking culture. The second plot, representing wine consumption, shows that French and Portugese drink the most wine. It is not very suprising given these are one of the most well known wine producing countries in the world. Last but not least, looking at the third plot we can see that Grenada and Belarus are the most spirit-loving countries. We can safely assume that Grenadians prefer to drink rum while Belarusians enjoy their vodka.

2 Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Aveng…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", …
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorro…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 2…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, …
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08,…
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08,…
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 92…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, …
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 3…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2,…
skim(movies)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of facebook likes cast memebrs received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating

2.1 Use your data import, inspection, and cleaning skills to answer the following:

  • Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries? duplicates
  • Produce a table with the count of movies by genre, ranked in descending order
  • Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order
  • Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.
  • Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.
movies <- read_csv(here::here("data", "movies.csv"))

movies %>%
  group_by(genre) %>%
  count(sort=TRUE)
## # A tibble: 17 x 2
## # Groups:   genre [17]
##    genre           n
##    <chr>       <int>
##  1 Comedy        848
##  2 Action        738
##  3 Drama         498
##  4 Adventure     288
##  5 Crime         202
##  6 Biography     135
##  7 Horror        131
##  8 Animation      35
##  9 Fantasy        28
## 10 Documentary    25
## 11 Mystery        16
## 12 Sci-Fi          7
## 13 Family          3
## 14 Musical         2
## 15 Romance         2
## 16 Western         2
## 17 Thriller        1
moviesbygenre <- movies %>%
  group_by(genre)
summarise(moviesbygenre,mean(gross), mean(budget))
## # A tibble: 17 x 3
##    genre       `mean(gross)` `mean(budget)`
##    <chr>               <dbl>          <dbl>
##  1 Action          86583860.      71354888.
##  2 Adventure       95794257.      66290069.
##  3 Animation       98433792.      61701429.
##  4 Biography       45201805.      28543696.
##  5 Comedy          42630552.      24446319.
##  6 Crime           37502397.      26596169.
##  7 Documentary     17353973.       5887852.
##  8 Drama           37465371.      26242933.
##  9 Family         149160478.      14833333.
## 10 Fantasy         42408841.      17582143.
## 11 Horror          37713738.      13504916.
## 12 Musical         92084000        3189500 
## 13 Mystery         67533021.      39218750 
## 14 Romance         31264848.      25107500 
## 15 Sci-Fi          29788371.      27607143.
## 16 Thriller            2468         300000 
## 17 Western         20821884        3465000
movies_by_ratio <- movies %>%
  group_by(genre)
summarise(movies_by_ratio, mean_gross=mean(gross), mean_budget=mean(budget), return_on_budget=mean_gross/mean_budget) %>%
arrange(desc(return_on_budget))
## # A tibble: 17 x 4
##    genre       mean_gross mean_budget return_on_budget
##    <chr>            <dbl>       <dbl>            <dbl>
##  1 Musical      92084000     3189500          28.9    
##  2 Family      149160478.   14833333.         10.1    
##  3 Western      20821884     3465000           6.01   
##  4 Documentary  17353973.    5887852.          2.95   
##  5 Horror       37713738.   13504916.          2.79   
##  6 Fantasy      42408841.   17582143.          2.41   
##  7 Comedy       42630552.   24446319.          1.74   
##  8 Mystery      67533021.   39218750           1.72   
##  9 Animation    98433792.   61701429.          1.60   
## 10 Biography    45201805.   28543696.          1.58   
## 11 Adventure    95794257.   66290069.          1.45   
## 12 Drama        37465371.   26242933.          1.43   
## 13 Crime        37502397.   26596169.          1.41   
## 14 Romance      31264848.   25107500           1.25   
## 15 Action       86583860.   71354888.          1.21   
## 16 Sci-Fi       29788371.   27607143.          1.08   
## 17 Thriller         2468      300000           0.00823
movies_director <- movies %>%
  group_by(director) %>%
  summarise(total_gross = sum(gross, na.rm=TRUE), mean_gross=mean(gross, na.rm=TRUE), median_gross=median(gross,na.rm=TRUE),sd_gross=sd(gross, na.rm=TRUE),count = n()) %>%
  arrange(desc(total_gross))
top15_director <- head(movies_director,15)

top15_director
## # A tibble: 15 x 6
##    director          total_gross mean_gross median_gross   sd_gross count
##    <chr>                   <dbl>      <dbl>        <dbl>      <dbl> <int>
##  1 Steven Spielberg   4014061704 174524422.   164435221  101421051.    23
##  2 Michael Bay        2231242537 171634041.   138396624  127161579.    13
##  3 Tim Burton         2071275480 129454718.    76519172  108726924.    16
##  4 Sam Raimi          2014600898 201460090.   234903076  162126632.    10
##  5 James Cameron      1909725910 318287652.   175562880. 309171337.     6
##  6 Christopher Nolan  1813227576 226653447    196667606. 187224133.     8
##  7 George Lucas       1741418480 348283696    380262555  146193880.     5
##  8 Robert Zemeckis    1619309108 124562239.   100853835   91300279.    13
##  9 Clint Eastwood     1378321100  72543216.    46700000   75487408.    19
## 10 Francis Lawrence   1358501971 271700394.   281666058  135437020.     5
## 11 Ron Howard         1335988092 111332341    101587923   81933761.    12
## 12 Gore Verbinski     1329600995 189942999.   123207194  154473822.     7
## 13 Andrew Adamson     1137446920 284361730    279680930. 120895765.     4
## 14 Shawn Levy         1129750988 102704635.    85463309   65484773.    11
## 15 Ridley Scott       1128857598  80632686.    47775715   68812285.    14
movies_ratings <- movies %>%
  group_by(genre) %>%
  summarise(mean_rating=mean(rating, na.rm=TRUE), median_rating=median(rating, na.rm=TRUE), max_rating=max(rating, na.rm=TRUE), min_rating=min(rating, na.rm=TRUE), sd_rating=sd(rating, na.rm=TRUE))

movies_ratings
## # A tibble: 17 x 6
##    genre       mean_rating median_rating max_rating min_rating sd_rating
##    <chr>             <dbl>         <dbl>      <dbl>      <dbl>     <dbl>
##  1 Action             6.23          6.3         9          2.1     1.03 
##  2 Adventure          6.51          6.6         8.6        2.3     1.09 
##  3 Animation          6.65          6.9         8          4.5     0.968
##  4 Biography          7.11          7.2         8.9        4.5     0.760
##  5 Comedy             6.11          6.2         8.8        1.9     1.02 
##  6 Crime              6.92          6.9         9.3        4.8     0.849
##  7 Documentary        6.66          7.4         8.5        1.6     1.77 
##  8 Drama              6.73          6.8         8.8        2.1     0.917
##  9 Family             6.5           5.9         7.9        5.7     1.22 
## 10 Fantasy            6.15          6.45        7.9        4.3     0.959
## 11 Horror             5.83          5.9         8.5        3.6     1.01 
## 12 Musical            6.75          6.75        7.2        6.3     0.636
## 13 Mystery            6.86          6.9         8.5        4.6     0.882
## 14 Romance            6.65          6.65        7.1        6.2     0.636
## 15 Sci-Fi             6.66          6.4         8.2        5       1.09 
## 16 Thriller           4.8           4.8         4.8        4.8    NA    
## 17 Western            5.70          5.70        7.3        4.1     2.26
ggplot(data = movies, mapping = aes(x = rating)) + geom_histogram() + scale_y_log10() + facet_wrap(~ genre)

## Use ggplot to answer the following

  • Examine the relationship between gross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?
ggplot(data = movies, mapping = aes(x = cast_facebook_likes, y = gross)) + geom_point() + geom_smooth(method=lm) +  scale_x_log10()  + scale_y_log10() + NULL

  • Examine the relationship between gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.
ggplot(data = movies, mapping = aes(x = budget, y = gross)) + geom_point() + geom_smooth(method=lm) +  scale_x_log10()  + scale_y_log10() + NULL

  • Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?
ggplot(data = movies, mapping = aes(x = rating, y = gross)) + geom_point() + geom_smooth(method=lm) +  scale_x_log10()  + scale_y_log10() + NULL

3 Returns of financial stocks

You may find useful the material on finance data sources.

We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.

We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.

nyse <- read_csv(here::here("data","nyse.csv"))
glimpse(nyse)
## Rows: 508
## Columns: 6
## $ symbol        <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", …
## $ name          <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie…
## $ ipo_year      <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "199…
## $ sector        <chr> "Health Care", "Consumer Durables", "Health Care", "Hea…
## $ industry      <chr> "Medical/Dental Instruments", "Electrical Products", "M…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasda…
skim(nyse)
Data summary
Name nyse
Number of rows 508
Number of columns 6
_______________________
Column type frequency:
character 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
symbol 0 1 1 4 0 508 0
name 0 1 5 48 0 505 0
ipo_year 0 1 3 4 0 33 0
sector 0 1 6 21 0 12 0
industry 0 1 5 62 0 103 0
summary_quote 0 1 31 34 0 508 0

Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order

companies_per_sector <- nyse %>%
  group_by(sector)%>%
  summarise(Number_of_Companies = n()) %>%
  arrange(desc(Number_of_Companies))

ggplot(data = companies_per_sector, mapping = aes(x=Number_of_Companies, y=reorder(sector,Number_of_Companies))) + labs(title="Sectors with number of companies in it \n",y="", x="Number of Companies") + geom_col() + theme_economist() + scale_x_continuous(expand = c(0,0))

companies_per_sector
## # A tibble: 12 x 2
##    sector                Number_of_Companies
##    <chr>                               <int>
##  1 Finance                                97
##  2 Consumer Services                      79
##  3 Public Utilities                       60
##  4 Capital Goods                          45
##  5 Health Care                            45
##  6 Energy                                 42
##  7 Technology                             40
##  8 Basic Industries                       39
##  9 Consumer Non-Durables                  31
## 10 Miscellaneous                          12
## 11 Transportation                         10
## 12 Consumer Durables                       8

Next, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).

# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- c("AAPL","JPM","DIS","DPZ","ANF","TSLA","XOM","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2020-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 19,448
## Columns: 8
## Groups: symbol [8]
## $ symbol   <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAP…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07,…
## $ open     <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, …
## $ high     <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, …
## $ low      <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, …
## $ close    <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, …
## $ volume   <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, …
## $ adjusted <dbl> 10.2, 10.2, 10.3, 10.3, 10.4, 10.6, 10.5, 10.6, 10.7, 10.8, …

Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.

monthly_R <- myStocks_returns_monthly %>%
  summarise(mean_return=mean(monthly_returns, na.rm=TRUE), median_return=median(monthly_returns, na.rm=TRUE), max_return=max(monthly_returns, na.rm=TRUE), min_return=min(monthly_returns, na.rm=TRUE), sd_return=sd(monthly_returns, na.rm=TRUE))

monthly_R
## # A tibble: 8 x 6
##   symbol mean_return median_return max_return min_return sd_return
##   <chr>        <dbl>         <dbl>      <dbl>      <dbl>     <dbl>
## 1 AAPL      0.0249        0.0257        0.200     -0.181    0.0782
## 2 ANF      -0.000830     -0.000426      0.507     -0.421    0.139 
## 3 DIS       0.0142        0.0114        0.234     -0.179    0.0646
## 4 DPZ       0.0324        0.0253        0.342     -0.188    0.0746
## 5 JPM       0.0124        0.0210        0.172     -0.229    0.0719
## 6 SPY       0.0112        0.0146        0.127     -0.125    0.0381
## 7 TSLA      0.0513        0.0148        0.811     -0.224    0.171 
## 8 XOM      -0.000410      0.000105      0.224     -0.262    0.0591

Plot a density plot, using geom_density(), for each of the stocks

ggplot(data = myStocks_returns_monthly, mapping = aes(x=monthly_returns)) + labs(title="Monthly returns \n",y="", x="") + geom_density() + theme_economist() + facet_wrap(~ symbol)

ggplot(myStocks_returns_monthly, aes(x=monthly_returns)) +
  geom_density(fill = "darkgreen", alpha = 1) +
  labs(title = "Monthly returns of portfolio since 01 January 2017", x = "Monthly returns", y = "Density") +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 8)) + 
  facet_wrap(~symbol, scales='fixed') +
  theme_clean()

What can you infer from this plot? Which stock is the riskiest? The least risky?

TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.

Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock

library(ggrepel)
g_risk_return <- ggplot(data = monthly_R, mapping = aes(y=mean_return, x=sd_return, label = symbol)) + geom_point(color = "red")

g_2 <- g_risk_return + geom_text_repel() + labs(title = "Return vs. Risk", x="risk", y="return")

g_2

ggplot(monthly_R, aes(x = sd_return, y = mean_return, colour = symbol)) +
  geom_point() +
  ggrepel::geom_text_repel(aes(label = symbol)) +
  labs(title = "Expected Monthly Return of Portfolio Stocks",
       y = "Expected return (Mean)", x = "Risk (SD)") +
  theme_clean() +
  theme(legend.position = 'none')

What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.

4 On your own: IBM HR Analytics

For this task, you will analyse a data set on Human Resoruce Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.

First let us load the data

hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, …
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", …
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Trave…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358…
## $ Department               <chr> "Sales", "Research & Development", "Research…
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26,…
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3,…
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3,…
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", …
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, …
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2,…
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1,…
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "La…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3,…
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "M…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 26…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 996…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5,…
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes"…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, …
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3,…
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2,…
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0,…
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, …
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4,…
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3,…
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2,…
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0,…
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3,…

I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description

hr_cleaned <- hr_dataset %>% 
  clean_names() %>% 
  mutate(
    education = case_when(
      education == 1 ~ "Below College",
      education == 2 ~ "College",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Master",
      education == 5 ~ "Doctor"
    ),
    environment_satisfaction = case_when(
      environment_satisfaction == 1 ~ "Low",
      environment_satisfaction == 2 ~ "Medium",
      environment_satisfaction == 3 ~ "High",
      environment_satisfaction == 4 ~ "Very High"
    ),
    job_satisfaction = case_when(
      job_satisfaction == 1 ~ "Low",
      job_satisfaction == 2 ~ "Medium",
      job_satisfaction == 3 ~ "High",
      job_satisfaction == 4 ~ "Very High"
    ),
    performance_rating = case_when(
      performance_rating == 1 ~ "Low",
      performance_rating == 2 ~ "Good",
      performance_rating == 3 ~ "Excellent",
      performance_rating == 4 ~ "Outstanding"
    ),
    work_life_balance = case_when(
      work_life_balance == 1 ~ "Bad",
      work_life_balance == 2 ~ "Good",
      work_life_balance == 3 ~ "Better",
      work_life_balance == 4 ~ "Best"
    )
  ) %>% 
  select(age, attrition, daily_rate, department,
         distance_from_home, education,
         gender, job_role,environment_satisfaction,
         job_satisfaction, marital_status,
         monthly_income, num_companies_worked, percent_salary_hike,
         performance_rating, total_working_years,
         work_life_balance, years_at_company,
         years_since_last_promotion)
count(hr_cleaned, attrition == "Yes")
## # A tibble: 2 x 2
##   `attrition == "Yes"`     n
##   <lgl>                <int>
## 1 FALSE                 1233
## 2 TRUE                   237
distribution_keyvalues <- hr_cleaned %>%
  summarise(mean_age=mean(age, na.rm=TRUE), median_age=median(age, na.rm=TRUE), max_age=max(age, na.rm=TRUE), min_age=min(age, na.rm=TRUE), sd_age=sd(age, na.rm=TRUE), mean_years_at_company=mean(years_at_company, na.rm=TRUE), median_years_at_company=median(years_at_company, na.rm=TRUE), max_years_at_company=max(years_at_company, na.rm=TRUE), min_years_at_company=min(years_at_company, na.rm=TRUE), sd_years_at_company=sd(years_at_company, na.rm=TRUE), mean_monthly_income=mean(monthly_income, na.rm=TRUE), median_monthly_income=median(monthly_income, na.rm=TRUE), max_monthly_income=max(monthly_income, na.rm=TRUE), min_monthly_income=min(monthly_income, na.rm=TRUE), sd_monthly_income=sd(monthly_income, na.rm=TRUE), mean_years_since_last_promotion=mean(years_since_last_promotion, na.rm=TRUE), median_years_since_last_promotion=median(years_since_last_promotion, na.rm=TRUE), max_years_since_last_promotion=max(years_since_last_promotion, na.rm=TRUE), min_years_since_last_promotion=min(years_since_last_promotion, na.rm=TRUE), sd_years_since_last_promotion=sd(years_since_last_promotion, na.rm=TRUE))


distribution_keyvalues
## # A tibble: 1 x 20
##   mean_age median_age max_age min_age sd_age mean_years_at_c… median_years_at…
##      <dbl>      <dbl>   <dbl>   <dbl>  <dbl>            <dbl>            <dbl>
## 1     36.9         36      60      18   9.14             7.01                5
## # … with 13 more variables: max_years_at_company <dbl>,
## #   min_years_at_company <dbl>, sd_years_at_company <dbl>,
## #   mean_monthly_income <dbl>, median_monthly_income <dbl>,
## #   max_monthly_income <dbl>, min_monthly_income <dbl>,
## #   sd_monthly_income <dbl>, mean_years_since_last_promotion <dbl>,
## #   median_years_since_last_promotion <dbl>,
## #   max_years_since_last_promotion <dbl>, min_years_since_last_promotion <dbl>,
## #   sd_years_since_last_promotion <dbl>
ggplot(hr_cleaned, aes(x=age)) +
  labs(title = "Boxplot - Age Distribution",  x = "Age") +
  geom_boxplot() +
  theme_classic()

ggplot(hr_cleaned,aes(x=years_at_company)) +
  labs(title = "Boxplot - Company tenure",  x = "Years at company") +
  geom_boxplot() +
  theme_classic()

ggplot(hr_cleaned,aes(x=monthly_income)) + 
  labs(title = "Boxplot - Monthly income",  x = "Income") +
  geom_boxplot() +
  theme_classic()

ggplot(hr_cleaned,aes(x=years_since_last_promotion)) +
  labs(title = "Boxplot - Years since last promotion",  x = "Years since last promotion") +
  geom_boxplot() +
  theme_classic()

count(hr_cleaned, job_satisfaction == "Low", job_satisfaction == "Medium", job_satisfaction == "High", job_satisfaction == "Very High")
## # A tibble: 4 x 5
##   `job_satisfaction… `job_satisfactio… `job_satisfactio… `job_satisfactio…     n
##   <lgl>              <lgl>             <lgl>             <lgl>             <int>
## 1 FALSE              FALSE             FALSE             TRUE                459
## 2 FALSE              FALSE             TRUE              FALSE               442
## 3 FALSE              TRUE              FALSE             FALSE               280
## 4 TRUE               FALSE             FALSE             FALSE               289
job_satis <- hr_cleaned %>%
  group_by(job_satisfaction) %>%
  count() %>%
  ungroup() %>%
  mutate(pct_job_satisfaction = 100*n/sum(n))

job_satis
## # A tibble: 4 x 3
##   job_satisfaction     n pct_job_satisfaction
##   <chr>            <int>                <dbl>
## 1 High               442                 30.1
## 2 Low                289                 19.7
## 3 Medium             280                 19.0
## 4 Very High          459                 31.2
ggplot(data = job_satis, mapping =aes(x = job_satisfaction, y = pct_job_satisfaction)) + geom_col() + NULL

count(hr_cleaned, work_life_balance == "Bad", work_life_balance == "Good", work_life_balance == "Better", work_life_balance == "Best")
## # A tibble: 4 x 5
##   `work_life_balanc… `work_life_balan… `work_life_balan… `work_life_balan…     n
##   <lgl>              <lgl>             <lgl>             <lgl>             <int>
## 1 FALSE              FALSE             FALSE             TRUE                153
## 2 FALSE              FALSE             TRUE              FALSE               893
## 3 FALSE              TRUE              FALSE             FALSE               344
## 4 TRUE               FALSE             FALSE             FALSE                80
hr_cleaned %>%
  group_by(work_life_balance) %>%
  count() %>%
  ungroup() %>%
  mutate(pct_work_life_balance = 100*n/sum(n))
## # A tibble: 4 x 3
##   work_life_balance     n pct_work_life_balance
##   <chr>             <int>                 <dbl>
## 1 Bad                  80                  5.44
## 2 Best                153                 10.4 
## 3 Better              893                 60.7 
## 4 Good                344                 23.4
ggplot(data = hr_cleaned, mapping = aes(x = monthly_income)) + geom_density() + facet_wrap(~ education) + NULL

ggplot(data = hr_cleaned, mapping = aes(x = monthly_income)) + geom_density() + facet_wrap(~ gender) + NULL

income_education <- hr_cleaned %>%
  group_by(education) %>%
  summarise(mean_income=mean(monthly_income, na.rm=TRUE), median_income=median(monthly_income, na.rm=TRUE), max_income=max(monthly_income, na.rm=TRUE), min_income=min(monthly_income, na.rm=TRUE), sd_income=sd(monthly_income, na.rm=TRUE))

income_education
## # A tibble: 5 x 6
##   education     mean_income median_income max_income min_income sd_income
##   <chr>               <dbl>         <dbl>      <dbl>      <dbl>     <dbl>
## 1 Bachelor            6517.         4762       19926       1081     4817.
## 2 Below College       5641.         3849       19973       1009     4485.
## 3 College             6227.         4892.      19613       1051     4525.
## 4 Doctor              8278.         6203       19586       2127     5061.
## 5 Master              6832.         5342.      19999       1359     4657.
income_gender <- hr_cleaned %>%
  group_by(gender) %>%
  summarise(mean_income=mean(monthly_income, na.rm=TRUE), median_income=median(monthly_income, na.rm=TRUE), max_income=max(monthly_income, na.rm=TRUE), min_income=min(monthly_income, na.rm=TRUE), sd_income=sd(monthly_income, na.rm=TRUE))

income_gender
## # A tibble: 2 x 6
##   gender mean_income median_income max_income min_income sd_income
##   <chr>        <dbl>         <dbl>      <dbl>      <dbl>     <dbl>
## 1 Female       6687.         5082.      19973       1129     4696.
## 2 Male         6381.         4838.      19999       1009     4715.
ggplot(data = hr_cleaned, mapping = aes(x = reorder(factor(job_role), -monthly_income), y = monthly_income)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90)) + NULL

income_bar_education <- hr_cleaned %>%
  group_by(education) %>%
  summarise(mean_income=mean(monthly_income, na.rm=TRUE))

income_bar_education
## # A tibble: 5 x 2
##   education     mean_income
##   <chr>               <dbl>
## 1 Bachelor            6517.
## 2 Below College       5641.
## 3 College             6227.
## 4 Doctor              8278.
## 5 Master              6832.
ggplot(data = income_bar_education, mapping = aes(x = education, y = mean_income)) +
  geom_col()

ggplot(data = hr_cleaned, mapping = aes(x = monthly_income)) + geom_density() + facet_wrap(~ education) + theme_solarized() + NULL

ggplot(data = hr_cleaned, mapping = aes(x = age, y = monthly_income)) + geom_point() + geom_smooth(method=lm) + facet_wrap(~ job_role) + NULL

Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:

  1. How often do people leave the company (attrition)
  2. How are age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?
  3. How are job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of total
  4. Is there any relationship between monthly income and education? Monthly income and gender?
  5. Plot a boxplot of income vs job role. Make sure the highest-paid job roles appear first
  6. Calculate and plot a bar chart of the mean (or median?) income by education level.
  7. Plot the distribution of income by education level. Use a facet_wrap and a theme from ggthemes
  8. Plot income vs age, faceted by job_role

5 Challenge 1: Replicating a chart

The purpose of this exercise is to make a publication-ready plot using your dplyr and ggplot2 skills. Open the journal article “Riddell_Annals_Hom-Sui-Disparities.pdf”. Read the abstract and have a look at Figure 3. The data you need is “CDC_Males.csv”.

Don’t worry about replicating it exactly, try and see how far you can get. You’re encouraged to work together if you want to and exchange tips/tricks you figured out.

You may find these helpful:

## Rows: 104
## Columns: 28
## $ ST                      <chr> "AK", "AK", "AL", "AL", "AR", "AR", "AZ", "AZ…
## $ State                   <chr> "Alaska", "Alaska", "Alabama", "Alabama", "Ar…
## $ Population.Black        <dbl> 153664, 153664, 5376635, 5376635, 1991165, 19…
## $ Population.White        <dbl> 2274605, 2274605, 14267619, 14267619, 9747025…
## $ Deaths.homicide.Black   <dbl> 23, NA, 183, 1812, 667, 90, 300, 40, 3712, 43…
## $ Deaths.homicide.White   <dbl> 78, 10, 124, 620, 416, 110, 586, 145, 1399, 6…
## $ crude.homicide.Black    <dbl> 14.97, NA, 3.40, 33.70, 33.50, 4.52, 22.06, 2…
## $ crude.homicide.White    <dbl> 3.43, NA, 0.87, 4.35, 4.27, 1.13, 3.46, 0.86,…
## $ adjusted.homicide.Black <dbl> 12.30, NA, 3.51, 33.00, 33.39, 4.81, 20.40, 2…
## $ adjusted.homicide.White <dbl> 3.24, NA, 0.85, 4.47, 4.39, 1.13, 3.63, 0.91,…
## $ Deaths.suicide.Black    <dbl> 23, 12, 148, 370, 118, 75, 101, 80, 492, 639,…
## $ Deaths.suicide.White    <dbl> 535, 196, 1222, 3195, 2154, 1027, 4146, 1972,…
## $ crude.suicide.Black     <dbl> 14.97, NA, 2.75, 6.88, 5.93, 3.77, 7.43, 5.88…
## $ crude.suicide.White     <dbl> 23.52, 8.62, 8.56, 22.39, 22.10, 10.54, 24.47…
## $ adjusted.suicide.Black  <dbl> 12.77, NA, 2.84, 7.20, 6.25, 3.93, 7.51, 5.58…
## $ adjusted.suicide.White  <dbl> 23.47, 8.21, 8.64, 20.97, 20.98, 10.89, 21.42…
## $ type                    <chr> "Firearm", "Non-Firearm", "Non-Firearm", "Fir…
## $ crude.RD.suicide        <dbl> -8.55, NA, -5.81, -15.51, -16.17, -6.77, -17.…
## $ adj.RD.suicide          <dbl> -10.70, NA, -5.80, -13.77, -14.73, -6.96, -13…
## $ crude.RD.homicide       <dbl> 11.54, NA, 2.53, 29.35, 29.23, 3.39, 18.60, 2…
## $ adj.RD.homicide         <dbl> 9.06, NA, 2.66, 28.53, 29.00, 3.68, 16.77, 1.…
## $ ST.order.RD.homicide    <chr> "AK", "AK", "AL", "AL", "AR", "AR", "AZ", "AZ…
## $ ST.order.RD.suicide     <chr> "AK", "AK", "AL", "AL", "AR", "AR", "AZ", "AZ…
## $ gun.house.prev          <dbl> 59.8, 59.8, 52.2, 52.2, 58.8, 58.8, 32.3, 32.…
## $ gun.house.prev.category <chr> "45.0%-65.5%", "45.0%-65.5%", "45.0%-65.5%", …
## $ average.pop.white       <dbl> 252734, 252734, 1585291, 1585291, 1083003, 10…
## $ average.pop.black       <dbl> 17074, 17074, 597404, 597404, 221241, 221241,…
## $ type.fac                <chr> "Firearm-related", "Firearm-unrelated", "Fire…
## Rows: 50
## Columns: 3
## $ ST                      <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE…
## $ average.pop.white       <dbl> 252734, 1585291, 1083003, 1882345, 7703022, 1…
## $ gun.house.prev.category <chr> "45.0%-65.5%", "45.0%-65.5%", "45.0%-65.5%", …
## Rows: 52
## Columns: 3
## $ ST                      <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC…
## $ adjusted.homicide.White <dbl> 3.24, 4.47, 4.39, 3.63, 2.05, 1.75, 0.86, NA,…
## $ adjusted.suicide.White  <dbl> 23.47, 20.97, 20.98, 21.42, 11.63, 19.48, 6.4…
## Rows: 50
## Columns: 5
## $ ST                      <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE…
## $ adjusted.homicide.White <dbl> 3.24, 4.47, 4.39, 3.63, 2.05, 1.75, 0.86, 1.7…
## $ adjusted.suicide.White  <dbl> 23.47, 20.97, 20.98, 21.42, 11.63, 19.48, 6.4…
## $ average.pop.white       <dbl> 252734, 1585291, 1083003, 1882345, 7703022, 1…
## $ gun.house.prev.category <chr> "45.0%-65.5%", "45.0%-65.5%", "45.0%-65.5%", …

6 Challenge 2: 2016 California Contributors plots

As discussed in class, I would like you to reproduce the plot that shows the top ten cities in highest amounts raised in political contributions in California during the 2016 US Presidential election.

To get this plot, you must join two dataframes; the one you have with all contributions, and data that can translate zipcodes to cities. You can find a file with all US zipcodes, e.g., here http://www.uszipcodelist.com/download.html.

The easiest way would be to create two plots and then place one next to each other. For this, you will need the patchwork package. https://cran.r-project.org/web/packages/patchwork/index.html

While this is ok, what if one asked you to create the same plot for the top 10 candidates and not just the top two? The most challenging part is how to reorder within categories, and for this you will find Julia Silge’s post on REORDERING AND FACETTING FOR GGPLOT2 useful.

# Make sure you use vroom() as it is significantly faster than read.csv()
CA_contributors_2016 <- vroom::vroom(here::here("data","CA_contributors_2016.csv"))

7 Deliverables

There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

8 Details

  • Who did you collaborate with: TYPE NAMES HERE
  • Approximately how much time did you spend on this problem set: ANSWER HERE
  • What, if anything, gave you the most trouble: ANSWER HERE

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

9 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.